Data formats for phonological corpora
نویسندگان
چکیده
The annotation of linguistic resources has long-standing traditions (see Cole et al., 2010). The other chapters of this book make clear that the production of annotated resources is a laborious, time-consuming, andexpensive task. In theory, we want to make these resources available in such a way that they can be re-used by as many scholars as possible (see Ide&Romary, 2002). However, a largevariety of annotation formatshave been developed in the previous decades, each one created for a specific research task. Consequently,the resulting resources are frequently only usable by members of the individual research projects. The goal of the present chapter is to explore the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardised formats that allow a high re-use rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources. We hope to identify standards thatcoverall possible aspects of the management workflow of spoken data, from the actual representation of raw recordings and transcriptions to high-level content-related information at a semantic or pragmatic level. Most of the challenges here are similar to those for textual resources, except for, on the one hand, the grounding relation that spoken data has to illocutionary circumstances (time, place, speakers and addressees), and, on the other hand, the specific annotation layers that correspond to speech related information (e.g. prosody), comprising multimodal aspects such as gestures. We should also not forget, as is well illustrated in this book, the importance of legacy practices in the spoken corpora community, most of them resulting from the existence of specific tools at various representation layers, ranging from basic transcription tools (Transcriber, PRAAT) to generic score-based annotation various tools do not have the same maintenance rate and capacity and it is therefore essential to think about standardised formats as offering the possibility to be embedded with existing practices. This implies that we have two basic scenarios in mind: We want to be able to project existing data into a range of standardised representations that bear as little specificity to the original format as possible but as much faithfulness as necessary; We want standardised formats to havethe capacity to be used for the development of new technical platforms, thus allowing the integration of new requirements and new features. These two general requirements both imply standards that can incorporate features and …
منابع مشابه
Comparability of lexical corpora: word frequency in phonological generalization.
Statistical regularities in language have been examined for new insight to the language acquisition process. This line of study has aided theory advancement, but it also has raised methodological concerns about the applicability of corpora data to child populations. One issue is whether it is appropriate to extend the regularities observed in the speech of adults to developing linguistic system...
متن کاملQuerying Both Time-aligned and Hierarchical Corpora with NXT Search
One problem of the (re-)usability and exchange of annotated corpora is in the lack of standards in corpus formats and corpus query tools. This paper reports on the NXT Search tool, which was used to query two corpora with very different annotation formats. It is shown that with automatic data format conversion both corpora can be accessed and searched with NXT Search.
متن کاملPolyglot and Speech Corpus Tools: A System for Representing, Integrating, and Querying Speech Corpora
Speech datasets from many languages, styles, and sources exist in the world, representing significant potential for scientific studies of speech—particularly given structural similarities among all speech datasets. However, studies using multiple speech corpora remain difficult in practice, due to corpus size, complexity, and differing formats. We introduce open-source software for unified corp...
متن کاملUsing The Web As A Phonological Corpus: A Case Study From Tagalog
Some languages’ orthographic properties allow written data to be used for phonological research. This paper reports on an on-going project that uses a web-derived text corpus to study the phonology of Tagalog, a language for which large corpora are not otherwise available. Novel findings concerning the phenomenon of intervocalic tapping are discussed in detail, and an overview of other phonolog...
متن کاملSWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
It is well known that word aligned parallel corpora are valuable linguistic resources. Since many factors affect automatic alignment quality, manual post-editing may be required in some applications. While there are several state-of-the-art word-aligners, such as GIZA++ and Berkeley, there is no simple visual tool that would enable correcting and editing aligned corpora of different formats. We...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1110.1758 شماره
صفحات -
تاریخ انتشار 2011